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SUMMARY 


As  part  of  an  extensive  Job  performance  measurement  research  and  development  program,  the  Air 
Force  Human  Resources  Laboratory  has  developed  a  new  methodology  called  Walk-Through  Performance 
Testing  (WTPT).  WTPT  is  a  task-level  job  performance  measurement  system  which  combines  hands-on 
testing  and  interview  testing  to  provide  a  high-fidelity  measure  of  an  individual's  technical  Job 
competency.  This  document  contains  a  series  of  papers,  originally  presented  at  the  92nd 
Convention  of  the  American  Psychological  Association,  which  outline:  (a)  the  conceptual  frame- 
of-reference  within  which  the  original  planning  for  WTPT  took  place,  (b)  the  WTPT  methodology  and 
the  rationale  and  overall  approach  to  hands-on  and  interview  test  development,  (c)  the  sampling 
strategy  used  to  select  tasks  for  work  sample  development,  and  (d)  the  approach  used  to  analyze 
selected  tasks  for  WTPT  development.  A  final  section  discusses  the  implications  of  the 
measurement  strategy. 


1 


PREFACE 


The  Air  Force  Human  Resources  Laboratory  (AFHRL)  is  engaged  in  a  long-term  research 
and  development  effort  to  develop  criteria  for  validation  of  Air  Force  selection  and 
classification  procedures.  Both  work  sample  and  rating  forms  at  varying  levels  of 
specificity  are  currently  being  developed.  The  work  sample  measurement  approach  being 
developed  and  evaluated  by  AFHRL  is  the  topic  of  this  technical  paper. 

The  basic  content  of  these  four  papers  was  presented  originally  at  the  92nd  Annual 
Convention  of  the  American  Psychological  Association  in  Toronto,  Canada.  The  symposium 
was  chaired  by  Or.  Sheldon  Zedeck  (University  of  California  at  Berkeley).  Or.  Terry 
Dickinson  (Old  Dominion  University)  served  as  symposium  discussant.  A  paper  based  on 
comments  provided  at  the  symposium  by  Or.  Dickinson  is  Included  in  the  report. 
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WALK-THROUGH  PERFORMANCE  TESTING: 

AN  INNOVATIVE  APPROACH  TO  WORK  SAMPLE  TESTING 

INTRODUCTION 

This  series  of  papers  presents  a  newly  developed  performance  measurement  methodology,  and 
provides  a  detailed  explanation  of  the  rationale,  developmental  process  and  potential  payoffs  of 
such  a  technology.  This  new  approach,  Walk-Through  Performance  Testing  (WTPT),  is  being 
developed  to  expand  the  range  of  job  tasks  measured,  to  Include  tasks  that  do  not  lend  themselves 
to  hands-on  testing.  The  first  paper,  by  Gould  and  Hedge,  outlines  the  general  conceptual 
performance  measurement  frame-of-reference  within  which  the  original  planning  for  wTPT  took 
place.  In  the  next  paper,  Hedge  discusses  in  detail  the  WTPT  methodology,  and  the  rationale  and 
overall  approach  to  hands-on  and  interview  test  development.  Lipscomb's  paper  follows  with  an 
explanation  of  the  sampling  strategy  used  to  select  representative  tasks  for  work  sample 
development,  an  initial  step  in  the  development  process.  Next,  Ballentine  and  Lipscomb  present 
the  approach  used  in  task  analysis  for  WTPT  development.  Finally,  Dickinson  provides  comnents  on 
these  papers  and  discusses  implications  of  this  new  measurement  strategy.  Taken  together,  these 
papers  present  the  most  comprehensive  discussion  of  the  WTPT  process  to  date. 


HISTORY,  8ACKGR0UN0,  AND  THEORETICAL  BASES  OF 
WALK-THROUGH  PERFORMANCE  TESTING 

R.  Bruce  Gould 
and 

Jerry  W.  Hedge 

Air  Force  Human  Resources  Laboratory 

This  paper  describes  the  Air  Force  Human  Resources  Laboratory’s  (AFHRL)  research  and 
development  (RID)  program  for  development  of  individual  job  performance  measures.  The  job 
performance  measurement  literature  indicates  that  most  previous  efforts  have  used  broad-based 
generic  indices,  performance  ratings,  or  operational  measures,  with  their  inherent  problems  of 
inflation  and  halo  effects.  These  broad  measures  were  unable  to  take  into  account  task- 
level-specific  Influences  such  as  training  differences  or  differences  in  opportunities  to 
perform;  hence,  such  efforts  have  been  largely  unsuccessful.  However,  it  appears  that  current 
interest,  added  resources,  and  technology  developments  have  now  significantly  Increased  'the 
probability  of  developing  successful  measures  of  job  performance  to  be  used  as  criteria  in 
evaluating  manpower,  personnel,  and  training  programs. 

Several  Influences  have  highlighted  the  Air  Force’s  need  for  job  performance  measurement  and 
brought  ongoing  and  planned  programs  to  their  current  state.  Planning  for  the  RID  program  began 
3  years  ago,  on  the  recommendation  of  an  AFHRL  Research  Advisory  Panel  (composed  of  knowledgeable 
scientists  from  academia  and  industry,  as  well  as  peers  from  the  Army  and  Navy). 

The  panel  reviewed  the  entire  AFHRL  manpower,  personnel,  and  training  RIO  program  and 
recooNended  consolidation  of  separate  job  performance  measurement  efforts  into  a  single  unified 
RiO  program.  At  the  same  time,  the  Uniform  Guidelines  for  Employee  Selection  (1978)  and  a  review 
of  case  law  mandated  that  Air  Force  civilian  selection  systems  be  validated  against  job 
performance  measures.  Military  tests  are  exempted  from  this  legal  mandate  by  the  Office  of 
Management  and  Budget  (0MB),  but  0MB  has  been  reviewing  that  exemption.  Finally,  Congress 
mandated  that  military  selection  tests  be  validated  against  hands-on  job  performance  measures. 
These  operational,  legal,  and  Congressional  Imperatives  have  thus  provided  the  Impetus  to 
planning  and  obtaining  support  for  a  lengthy,  high  resource  RiO  effort. 

Twenty  years  of  extensive  occupational  RIO  and  a  comnltment  of  significant  resources  provided 
the  backdrop,  data  base,  and  means  to  solve  a  portion  of  the  "criterion  problem"  for  Air  Force 
researchers  and  program  evaluators.  In  addition,  the  Air  Force  has  now  completed  the  second  year 
of  a  7-year  RIO  effort  to  systematically  obtain  Job  performance  measures  that  will  serve  as 
criteria  In  validating  selection  systems  and  in  evaluating  training  programs  and  the  effects  of 
personnel  policies  and  procedures.  Previous  RIO  concerning  Air  Force  occupations  has  Identified 
the  major  Job  tasks  In  enlisted  specialties,  the  types  of  Individuals  who  perform  them,  the 
relative  difficulties  in  learning  to  perform  them,  and  the  relative  aptitude  requirements  of  the 
tasks.  This  occupational  data  base  provides  the  Initial  reference  point  for  Identifying  job 
tasks  to  be  measured,  as  well  as  objective  Indices  of  moderator  variables  such  as  task-level 
experience  which  otherwise  would  contribute  error  variance  to  the  measurement  of  Job  performance. 

A  conceptually  based  theoretical  framework  of  performance  measurement  Is  presented  to 
summarize  and  organize  research  progress  In  terms  of  previous  empirical  work  and  to  Identify 
future  RIO  needs.  The  present  program  1$  unique  In  Its  "research  purposes  only*  orientation.  Its 
concentration  on  Indlvldual-job-speclf  1c  tasks  rather  than  tasks  common  to  _aVT  J°&*  In  a 
specialty,  and  Its  consideration  of  different  types  of  measures  (Job  sample  testing;  objective 
Indices  of  productivity;  and  supervisory,  peer,  and  self  ratings)  as  tapping  both  overlapping  and 
unique  components  of  the  job  performance  criterion  space.  A  novel  Job  measurement  approach 
called  "walk-through  performance  testing"  is  used  as  the  high-fidelity  benchmark  against  which 
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less  tlme-conswlng  and  expensive  procedures  can  be  compared  through  a  successive  approximation 
research  strategy. 

The  Air  Force  job  performance  measurement  R&O  plan  has  not  been  developed  in  Isolation  from 
the  other  Services.  We  served  with  representatives  from  the  Navy  Personnel  Research  and 
Development  Center  (NPRDC)  on  the  Army  Research  Institute's  (ARI)  Armed  Services  Vocational 
Aptitude  8attery  (ASVAB)  validation  contract  evaluation  panel.  We  also  sponsored  an  informal 
trl-Service  workshop  on  job  performance  evaluation.  We  are  now  working  with  the  other  Services 
to  coordinate  a  Joint-Service  Job  Performance  Measurement  Program.  In  effect,  the  Services  are 
pooling  their  resources  by  dovetailing  research  plans  and  sharing  results. 

The  short-term  objective  of  the  Air  Force's  job  performance  measurement  program  is  the 
development  of  on-the-job  performance  measures  to  validate  Air  Force  selection  and  classification 
procedures.  Guidelines  for  developing  and  obtaining  the  performance  measures  will  be  established 
for  a  wide  range  of  enlisted,  officer,  and  civilian  jobs.  Once  obtained,  the  measures  will  be 
placed  in  a  data  base  for  test  validation  use. 

The  long-term  goal  is  to  establish  an  operational  performance  measurement  program  for  the 
evaluation  of  selection  and  training  procedures  and  personnel  policies  and  practices;  that  is,  to 
operationalize  procedures  such  that  the  performance  measurement,  validation,  and  evaluation  can 
be  carried  on  by  technicians.  In  this  way,  RtO  resources  will  be  freed  for  other  projects. 

Conceptual  Performance  Measurement  Model 

The  first  step  in  the  present  effort  was  to  develop  a  conceptually  based  descriptive  model  of 
performance  measurement  that  could  be  used  to  sumaarlze  and  organize  RU>  progress  in  terms  of 
previous  empirical  work,  and  to  identify  and  prioritize  future  research  needs  for  the  program. 
Time  will  not  permit  a  detailed  examination  here  of  model  development,  results  of  the  literature 
review,  and  research  issues  to  be  studied.  These  details  were  described  in  a  separate  report 
(Kavanagh,  Borman,  Hedge,  &  Gould,  1987).  The  development  process,  resulting  model,  and  some 
general  conclusions  will,  however,  be  outlined. 

Guidelines  established  for  model  development  were  that  the  model  should:  (a)  focus  on 
performance  measurement  used  in  the  military;  (b)  describe  performance  measurement  used  for 
"research  purposes  only”;  (c)  consider  all  variables,  based  on  the  theoretical  or  empirical 
literature,  that  could  affect  job  performance  or  performance  measurement;  (d)  use  a  general 
classical  test  score  theory  perspective  for  identifying  sources  of  true  and  error  variance  In 
observed  scores;  (e)  specify  classes  of  variables  rather  than  detailed  individual  variables;  (f) 
be  descriptive  rather  than  prescriptive  because  tested  causal  linkages  to  job  performance  are  as 
yet  too  Incomplete;  and  (g)  use  an  Iterative  process  that  begins  with  a  general  model  of  job 
performance  and  ends  with  a  model  of  measurement  quality  where  the  measures  are  to  be  used  as 
criteria  In  validation  or  evaluation  projects  only. 

First,  a  general  conceptually  and  empirically  based  model  was  developed  which  identified 
Individual  characteristics  that  may  Interact  with  supervisory  and  work  group  factors  to  influence 
Job  performance.  Organizational  factors  and  situational  constraints  were  also  Included.  Next,  a 
restriction  was  Imposed  on  the  general  model  that  only  factors  affecting  the  quality  of 
performance  measurement  would  be  considered,  and  a  second  more  detailed  model  resulted.  The 
emphasis  had  thus  shifted  to  Include  not  only  a  person's  job  performance  but  also  the  measurement 
method  used  to  record  that  performance,  as  well  as  the  characteristics  of  the  person  recording 
the  scores.  The  experience  of  Job  IncMbents,  including  opportunities  to  perform,  remained  a 
prominent  factor.  A  reasonably  exhaustive  list  of  the  variables  that  have  been  empirically 
demonstrated  to  affect,  or  potentially  affect,  performance  measurement  quality  was  compiled  at 
this  point  and  Is  shown  In  Table  1.  One  of  the  critical  variables  Identified  Is  the  measurement 
purpose. 
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Table  1.  Variables  That  Can  Inpact  Measurement  Quality 


1.  Individual  characteristics 

a.  Cognitive  variables:  rater  or  ratee 

b.  Rater/ratee  Intelligence 

c.  Rater/ratee  knowledge  of  the  job  being  evaluated 

d.  Rater/ratee  personal  characteristics 

e.  Rater/ratee  interpersonal  trust 

2.  Relationship  between  ratee  and  rater/observer 

a.  Sex  congruence 

b.  Race  congruence 

c.  Job  tenure  together 

d.  Age  congruence 

e.  Off-the-job  relationship 

f.  History  of  conflict  or  cooperation 

3.  Method/Source  of  measurement 

a.  Supervisor  ratings 

b.  Paer  ratings 

c.  Self  ratings 

d.  Subordinate  ratings 

e.  Assessment  center  (team)  ratings 

f.  Work  samples/simulations 

g.  Productivity  records 

4.  Scale  development 

a.  Critical  incidents  used 

b.  Job  description/job  requirements-based 

c.  Employee  participation 

d.  Top  management  support  during  development 

5.  Rating  scale  characteristics 

a.  Content  of  the  scale 

b.  Anchors  versus  no  anchors 

c.  Behaviors  versus  traits 

d.  Format  type 

e.  Number  of  anchors/scale  points 

f.  Single  versus  multiple  dimensions 

g.  Scaling  metric/approach 

6.  Performance  standards/goals 

a.  Present  or  not 

b.  Standards  versus  goals 

c.  Particlpati vely  set  and  communicated 

d.  Specificity  of  behavior  or  accomplishment  excepted 

7.  Social  context 

a.  Performance  level  of  others  in  work  group 

b.  Existence  of  group  norms 

c.  Rater's  status  in  group 

d.  Ratee1 $  status  in  group 
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Table  1.  (Concluded) 


8.  Non-work  variables 

a.  Marital  status 

b.  Pre-school  children  at  work 

c.  Dual-career  family 

d.  Participation  in  company  activities  off  the  job 

e.  Stressful  life  events  in  recent  past 

9.  Performance  constraints 

a.  Poor  information 

b.  Equipment  efficiency 

c.  Supplies  deficiency 

d.  Time  limitations 

e.  Poor  work  environment 

10.  Organitational/unit  norms 

a.  Upper  management's  expectation  of  certain  level  of  performance 

b.  Immediate  supervisor's  expectation  regarding  level  of  performance 

c.  Presence  of  a  union 

d.  Pay/rewards  tied  to  performance  levels  by  contract 

e.  Pay/rewards  tied  to  performance  levels  by  informal  norms 

11.  Public  relations/administrative  procedures 

a.  Required  or  not 

b.  Mode  of  presentation 

c.  Content  of  procedure 

12.  Rater  training 

a.  Content  of  training 

b.  Format  of  training 

c.  Length  of  training 

13.  Measurement  purpose 

a.  Validation  research  only 

b.  Employee  growth  and  development 

c.  Administrative  purposes  such  as  rewards 

d.  Meeting  legal  guidelines 

14.  Performance  feedback 

a.  Required  or  not 

b.  Sources  of  feedback 

c.  Participative 

d.  Clarity  of  feedback 

e.  Frequency  of  feedback 

15.  Pay-performance  relationship 

a.  Are  they  related  in  the  system? 

b.  Equity  of  the  relationship _ 

Performance  measurement  systems  have  four  major  purposes:  (a)  administrative  decisions,  (b) 
employee  growth  and  development,  (c)  validation/evaluatlon  research,  and  (d)  meeting  legal 
guidelines.  Since  our  purpose  was  to  obtain  performance  measures  for  validation  RtO,  we 
eliminated  those  model  components  that  were  not  related  to  that  outcome.  System  characteristics 


related  to  performance  feedback,  and  pay-for-perf ormance  relationsh\ps  were  eliminated.  The 
resulting  model  is  shown  in  Figure  1. 


Figure  1.  A  Job  Performance  Measurement  Classification 
Scheme  for  Validation  Research. 

The  model  categorizes  the  measurement  system  characteristics  on  the  left;  the  outcome, 
performance  measurement  quality,  on  the  right;  and  potential  intervening  variables  in  the  center. 

Regarding  Figure  1,  an  analogy  would  be  to  consider  the  system  characteristics  as  independent 
variables,  the  intervening  variables  as  moderators,  and  measurement  quality  as  the  dependent 
variable  in  a  multiple  regression  equation.  One  would  expect  the  beta  weights  to  change  for  the 
various  terms  in  the  equation  as  the  type  of  measurement  method  or  type  of  job  being  measured 
changes.  These  categories  of  independent  and  intervening  variables  provided  a  classification 
scheme  by  which  the  empirical  literature  was  organized  and  the  R40  issues  identified.  In 
establishing  a  connection  between  a  system  characteristic  and  the  outcome  (accuracy),  the 
relevant  literature  gives  specific  implications  for  the  measurement  system.  Accuracy  and 
construct  validity  are  the  primary  criteria  by  which  the  dependent  measure,  quality  of 
performance  measurement,  must  be  assessed.  The  reasons  for  including  intervening  variables  in 
the  center  of  the  model  should  become  more  apparent  following  a  discussion  of  the  nature  of  the 
measurement  systems  to  be  used  in  this  effort. 

Domain  to  be  Measured  Versus  Measurement  Method 

Let  us  now  consider  the  conclusions  drawn  between  the  relationship  of  measurement  method  ano 
the  validity  of  the  resulting  measures  and  how  they  were  translated  into  a  proposed  approach. 
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Typically,  performance  measures  have  suffered  varying  degrees  of  criterion  deficiency;  that  is, 
the  measures  did  not  adequately  sample  the  domain  of  tasks  to  be  performed  in  specific  jobs. 
Either  the  measures  were  not  based  on  an  adequate  job  analysis,  or  the  representative  tasks  could 
not  be  economically  measured  by  the  method  employed. 

Definition  of  job  tasks  is  not  a  problem  in  the  Air  Force  because  of  its  active  Occupational 
Survey  Program,  which  provides  current  task-level  data  on  all  major  enlisted  specialties.  More 
than  200  of  the  some  250  enlisted  specialties  are  re-surveyed  on  an  average  of  every  4  years. 
The  surveys  provide  task-level  measures  of  time  spent  performing,  time  required  to  learn  to 
perform,  and  relative  aptitude  requirements,  as  well  as  how  the  tasks  are  organized  into 
homogeneous  clusters.  The  clusters  of  tasks  which  are  performed  in  concert  ire  called  job  types. 

Identification  of  critical  tasks  in  a  specialty  is  not  difficult,  but  selection  of  tasks  for 
the  measurement  system  immediately  poses  the  question  of  just  what  “job"  means  in  job 
performance.  For  example,  Figure  2  shows  a  typical  situation  where  there  are  relatively  unique 
as  well  as  overlapping  job  types  within  an  Air  Force  specialty.  Here,  two  job  types  share  many 
conrnon  tasks,  most  of  which  are  critical  tasks— but  a  third  job  type  shares  very  few  tasks  coninon 
to  the  other  two  job  types.  Thus,  although  performance  can  be  measured  at  the  job  level,  how  do 
you  aggregate  across  job  types  to  make  a  collective  statement  about  performance  of  individuals  at 
the  specialty  level? 


Figure  2.  Job  Performance  Domain,. 

The  solution  to  this  problem  for  the  Air  Force  is  to  measure  both  a  sample  of  core 
specialty-wide  critical  tasks  and  a  sanple  of  Job- type-spec  if  1c  tasks.  Task-level  experience 
measures  will  also  be  collected  to  assess  the  impact  of  differences  in  opportunities  to  perform, 
such  that  individuals  can  be  compared  on  the  specialty-wide  set  of  core  tasks,  taking  experience 
differences  into  account.  However,  in  order  to  compare  performance  across  job  types  as  a  means 
of  drawing  specialty-wide  conclusions,  a  benchmarking/equating  strategy  must  be  developed. 
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For  the  Air  Force,  hands-on  testing  constitutes  a  particular  problem  because  of  the  nature  of 
Air  Force  equipment.  In  many  specialties,  the  number  of  tasks  that  can  be  measured  by  hands-on 
testing  Is  low.  In  that  the  tasks  tend  to  take  too  long  to  complete,  require  replacement  of 
expensive  parts,  or  present  possible  damage  to  system  components.  Thus,  a  walk-through  testing 
routine  was  devised  to  overcome  the  stated  problems.  WTPT  is  a  task-level  job  performance 
measurement  system  which  combines  task  performance  and  Interview  procedures  to  provide  a 
high-fidelity  measure  of  Individual  technical  job  competence.  The  test  takes  place  In  the  work 
setting  and  Is  a  type  of  show-and-tell  procedure. 


Development  of  the  Performance  Measures  and  the  Measurement  Process 

The  performance  measures  to  be  developed  range  from  the  micro  task-level  walk-through  test  to 
very  macro,  global  ratings.  From  the  occupational  survey  data,  specialty-  and  job- type- spec  If  1c 
critical  core  tasks  will  be  Identified.  Subject-matter  experts  (SMEs)  will  aid  the  developers  In 
dichotomizing  the  tasks  Into  those  which  are  economically  observable  and  those  which  must,  be 
measured  by  Interview.  The  SMEs  will  then  develop  the  procedures  for  conducting  the  observations 
and  Interviews,  as  well  as  specify  the  performance  standards  for  scoring  responses.  For  a  given 
specialty,  a  formal  testing  manual  similar  to,  say,  that  for  an  Individually  administered 
intelligence  test  will  be  developed  in  two  sections;  i.e.,  a  specialty  core  task  section  and  a 
job-type-specif ic  section.  There  will  be  an  overlap  between  tasks  measured  by  Interview  and 
hands-on  testing,  to  ensure  that  the  two  methods  demonstrate  similar  measurement  fidelity  and  can 
be  used  to  fully  sample  the  domain  of  technical  competency  skills. 

Before  testing,  the  job  incumbent  will  complete  the  self-ratings  of  experience  and 
proficiency.  The  examiner  will  then  take  the  ratee  to  the  work  site  and  administer  the 
walk-through  test,  having  the  ratee  point  out  the  procedures,  explain  component  functioning,  or 
actually  perform  tasks.  Target  test  administration  time  is  8  hours. 

The  detailed  rating  forms  will  have  a  one-to-one  task  correspondence  with  the  tasks  In  the 
walk-through  testing  manual.  In  addition,  the  other  forms  will  contain  behavioral  dimensions 
such  as  ability  to  perform  troubleshooting  or  administrative  tasks,  global  performance  ratings 
such  as  technical  competence  and  Interpersonal/social  skills,  and  ratings  that  apply  Air 
Force-wide  such  as  the  Leadership  and  Technical  Knowledge  factors  found  In  efficiency  reports. 


Other  Research  Issues 

Considering  the  nature  of  the  walk-through  testing  procedure,  the  reason  for  the 
concentration  on  the  Intervening  variables  In  the  center  of  Figure  1  should  now  be  more 
apparent.  Once  the  measures  are  developed,  the  value  of  the  data  obtained  Is  going  to  be  largely 
a  function  of  the  training,  motivation,  and  ability  of  the  rater/ratee*  For  this  reason,  much  of 
our  specific  research  must  be  aimed  at  these  areas.  In  terms  of  system  characteristics,  the 
rater-ratee  relationship,  organization/unit  norms,  and  public  relations  Issues  will  Initially  be 
minimized  by  the  use  of  professional  examiners.  However,  later  In  the  project,  when 
non-professional  examiners  will  be  considered  as  possible  test  administrators,  these  Issues  will 
have  to  be  addressed. 

The  measurement  method  issues  have  been  discussed  In  an  earlier  part  of  this  paper.  Scale 
characteristics  and  measurement  development  Issues  have  been  largely  resolved  In  the  literature. 
Concerning  specification  of  performance  standards,  we  are  confident  (based  on  past  experience) 
that,  with  our  guidance,  SMEs  will  be  able  to  develop  valid  standards.  Concerning  administration 
procedures  and  rater  training,  there  are  many  unanswered  questions;  therefore,  we  will 
concentrate  much  of  the  early  work  In  these  areas.  Although  the  studies  will  not  be  detailed  In 


this  series  of  papers,  the  basic  mechanism  we  plan  to  use  may  be  of  interest.  Most  similar 
studies  are  plagued  with  using  estimates  of  true  performance  as  criterion  measures.  We  plan  to 
write  scripts  and  videotape  airmen  performing  actual  tasks  and  then  use  the  videotapes  as  the 
medium  for  presenting  the  rating  situations  during  our  research.  This  way  we  will  know  the  true 
performance  score,  and  the  tapes  will  have  both  face  and  content  validity  for  the  experiments  to 
be  performed. 
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THE  METHODOLOGY  OF  WALK-THROUGH  PERFORMANCE  TESTING 


Jerry  W.  Hedge 

Air  Force  Hunan  Resources  Laboratory 

The  purpose  of  this  paper  Is  to  describe  the  development  of  a  new  assessment  methodology 
known  as  WTPT.  In  discussing  this  approach,  four  main  objectives  of  the  Air  Force  Human 
Resources  Laboratory's  (AFWL)  job  performance  measurement  RtO  effort  will  be  emphasized.  These 
four  objectives  are:  (a)  to  develop  a  measurement  methodology  that  allows  accurate  evaluation  of 
job  proficiency,  (b)  to  develop  a  measurement  technique  that  expands  coverage  of  the  job  content 
domain,  (c)  to  evaluate  the  comparability  of  information  gathered  through  the  two  components  of 
WTPT  (hands-on  and  interview  testing),  and  (d)  to  adapt  the  WTPT  methodology  to  a  variety  of  Air 
Force  specialties.  The  rationale  and  approach  associated  with  each  of  these  objectives  will  be 
detailed  in  the  following  sections  of  this  paper. 


Background 

In  assessing  on-the-job  performance,  a  variety  of  measures  are  available  from  which  to 
choose.  They  range  from  subjective  to  objective,  and  from  general  to  specific.  When  faced  with 
the  choice  of  which  criterion  to  select,  the  researcher  or  practitioner  typically  relies  on 
several  informal  decision  rules: 

1.  Cost,  in  terms  of  time,  money,  safety,  or  mission  effects. 

2.  Convenience  in  developing  or  obtaining  measures. 

3.  Fidelity  or  accuracy  of  replicating  behaviors  relevant  to  the  Job. 

The  development  of  a  criterion  measure  is  frequently  seen  as  a  secondary  concern  to  the  main 
research  focus  (i.e.,  a  training  program,  a  selection  system);  as  a  result.  Decision  Rule  2— 
convenience— is  frequently  applied.  The  outcome  is  thus  a  generic,  packaged-to-p lease  rating 
form  that  in  all  probability  will  also  satisfy  Decision  Rule  1,  but  not  necessarily  Decision 
Rule  3. 

When  the  chief  concern  of  researchers  and  practitioners  shifts  to  Oecision  Rule  3,  and 
fidelity  becomes  the  overriding  concern,  the  work  sample  orientation  presents  a  viable 
alternative  to  the  convenient  but  subjective  rating  form.  As  noted  by  Wilson  (1962),  over  the 
years  the  primary  use  of  the  work  sample  has  been  for  personnel  selection;  however,  this 
orientation  can  also  be  a  valuable  aid  in  the  measurement  of  job  proficiency.  Typically,  work 
sample  tests  involve  an  individual's  performing  a  task  or  set  of  tasks  that  are  relevant  to  that 
person's  job  and  are  selected  from  the  range  of  tasks  performed  by  that  person. 

The  value  of  the  work  sample  methodology  lies  In  the  fidelity  with  which  the  selected  set  of 
tasks  allows  measurement  of  an  Inciabent's  job  proficiency.  Unfortunately,  the  fact  that  tasks 
must  be  "selected*  reflects  the  technique's  chief  weakness.  Work  sample  procedures  normally 
Identify  critical  tasks,  discard  those  not  practically  measurable,  and  let  the  remainder  become 
the  "selected  set"  of  measured  tasks.  The  Air  Force's  approach  to  work  sample  testing  represents 
an  attempt  to  overcome  this  criterion  deficiency  problem. 


Halk-Throuoh  Performance  Testing 

For  the  Air  Force,  hands-on  testing  presents  a  particular  problem  because  of  the  complexity 
and  expense  Involved  In  performing  many  tasks.  For  exmplt,  many  critical  tasks  cannot  be 
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measured  by  hands-on  testing  because  these  tasks  tend  to  take  too  long  to  conplete,  require 
replacement  of  expensive  parts,  and  risk  possible  damage  to  components.  AFHRL  has  developed  a 
new  methodology  to  deal  with  these  problems.  This  new  approach,  WTPT,  has  as  its  foundation  the 
work  sample  philosophy,  but  attempts  to  expand  the  measurement  of  critical  tasks  to  include  those 
tasks  not  measured  by  hands-on  testing  through  the  use  of  an  interview  testing  conponent  (Gould  l 
Hedge,  1983). 

WTPT  is  a  task- level  job  performance  measurement  system  that  expands  the  range  of  job  tasks 
on  which  an  individual  is  measured,  by  combining  hands-on  task  performance  and  interview 
procedures  to  provide  a  high-fidelity  measure  of  individual  technical  joh  competence.  The 
interview  testing  component  has  been  added  as  a  means  of  assessing  those  critical  tasks 
previously  eliminated  from  the  content  domain  because  of  measurement  constraints. 


Overview  of  the  Developmental  Process 

Details  of  the  “Task  Selection  P1an“  used  to  define  the  job  content  domain  are  presented  i'h  a 
separate  paper  by  Lipscomb,  which  follows.  At  this  time,  however,  two  aspects  of  this  selection 
process  should  be  noted. 

First,  using  this  sampling  strategy,  tasks  are  classified  into  three  major  phases,  based  on 
measurement  specificity.  This  allows  work  samples  to  be  developed  for  tasks  that  are  common  to 
all  jobs  in  the  specialty  (Phase  I)  or  unique  to  a  particular  duty  area  or  job  type  (Phases  II  & 
III).  Secondly,  no  differentiation  is  yet  made  between  tasks  to  be  measured  by  hands-on  versus 
interview  testing.  However,  as  part  of  this  process,  information  _1s  being  collected  concerning 
length  of  time  required  to  perform  each  task  and  whether  the  task  is  measurable  by  hands-on 
testing.  Once  the  job  content  universe  has  been  reduced,  and  field  interviews  with  SMEs 
initiated,  those  decisions  will  be  made.  Information  about  this  process  can  be  found  in  the 
Bal lentine  and  Lipscomb  paper  contained  in  this  volume. 


WTPT  Components 

As  noted  previously,  WTPT  consists  of  two  main  components,  hands-on  testing  and  interview 
testing.  The  hands-on  component  resembles  a  traditional  hands-on  work  sample  test  designed  to 
measure  proficiency  on  a  critical  task  that  has  survived  the  in^osltion  of  time  and/or 
measurement  constraints.  For  example,  the  hands-on  task  outlined  in  Table  2  requires  the 
Incumbent  to  install  a  starter  on  the  jet  engine.  On  the  first  page  of  the  task  item, 
information  is  provided  to  the  test  administrator  concerning  testing  time;  required  tools, 
technical  orders,  and  job  guides;  pertinent  background  Information  and  necessary  engine 
configuration;  and  administrator's  testing  Instructions.  While  the  starter  is  being  installed, 
the  test  administrator  uses  the  checklist  to  indicate  whether  steps  (e.g.,  lubricate  the  spline, 
index  position  of  the  starter,  and  Install  the  locking  device)  are  performed  correctly  or  not. 
Finally,  a  5-polnt  rating  scale  Is  provided  so  an  overall  rating  of  proficiency  on  that  task  can 
also  be  recorded. 

Interview  testing  allows  the  administrator  to  measure  proficiency  on  tasks  precluded  from 
hands-on  measurement  (l.e.,  tasks  that  are  either  too  time-consuming,  too  costly,  or  too 
dangerous  for  hands-on  measurement).  Interview  testing  requires  the  administrator  to  assess  an 
incumbent's  proficiency  on  a  task  by  asking  questions  designed  to  uncover  proficiency-based 
strengths  and  weaknesses  related  to  the  performance  of  that  task.  For  example,  the  interview 
test  Item  In  Table  3  evaluates  the  jet  engine  mechanic's  ability  to  determine  the  source  of  high 
oil  consiaptlon.  Using  categories  similar  to  those  for  hands-on  Items,  the  test  administrator 
asks  the  Incumbent  to  show/explain  procedures  and  provide  answers  to  a  variety  of  questions.  In 
addition,  a  5-polnt  overall  proficiency  Kale  Is  completed  by  the  administrator. 
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Table  2.  Hands-On  Task  Item 


Phase  I  J - 7  9 .  J-57,  TF-33 
Shop  and  Flightline 


Hands-On  Task  3^7 


Ob  iecti  ve :  To  evaluate  the  incumbent's  ability  to  install 
starters. 


Estimated  Time:  25  M  Start: - Finish: _ littfl— flam 


Time  Limit:  15 M  »Tlme3  Performed: _ Last.  Performed i 


Consolidated  Tool  Kit,  0-  to  ibli-inch-pouno 
10-  to  300-inch-pound  Torque  Wrench, 


Tools  and  Equipment : 

Torque  Wrench, 

Lubr 1 ca  nt . 

J-79  (Fighter): 

J-57  (Tanker): 

TF-33  (P7)  (Cargo): 

General  Torquing 


Appropriate  T.  0.  : 

1F-4E-  10 

1C-135(K)A-2-4JG-6 
1C-141 A-2-4JG-5  or 
1C-141 B-10 

2-1-111  or  1  —  1  A—  8  or 
specific  engine  torquing  T.O.  : 
J-79:  2J-J79-86-7WP00100 

J-57:  1C-135(K)A-2-4JG-1 

TF-33:  1C-141B-10 


Background  Information:  There  are  some  common  steps  for  all 
three  engines,  but  each  engine  has  some  unique 
steps.  The  evaluation  will  be  made  on  the  common 
steps  except  when  indicated.  Differences  include: 

1.  J-57  has  two  cannon  plugs. 

J-79  and  TF-33  (P7)  have  one  cannon  plug. 

2.  J-57  and  TF-33  have  one  nut  on  the  V-clamp. 
J-79  has  two  nuts  on  the  V-clamp 

Two-person  task  when  actually  putting  the  starter 
in  piace.  This  is  the  only  task  for  which  the 
incumbent  will  be  required  to  actually  get  the 
technical  oraer  from  the  shelf. 


Engine  Conf iguratlon :  The  starter  adapter  pad  must  be  on  the 
engine.  The  starter  is  off  the  engine. 


Instructions: 

Administer  in  the  shop. 

The  Incumbent  MUST  uae  the  T.O. 

Compare  the  incumbent's  response  to  the  correct  answer  for  the 
appropriate  engine. 
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Table  2.  (Continued) 


Phase  I  J-79,  J-57,  TF-33  Hands-On  Task  3^7 

Shop  and  Flightline 


SAY  TO  THE  INCUMBENT 

GET  THE  T.O.  USED  TO  INSTALL  A  STARTER  AND  THE  T.  0.  FOR  GENERAL 
TORQUING  PROCEDURES,  THEN  INSTALL  THE  STARTER  USING  THE 
APPROPRIATE  PROCEDURES  FROM  BOTH  T.O.s.  FOLLOW  GENERAL 
MAINTENANCE  PROCEDURES  AT  ALL  TIMES.  TELL  ME  IF  YOU  PLAN  TO 
DEVIATE  FROM  THE  T.O.  YOU  MAY  NOT  ASK  ANYONE  TO  HELP  YOU  FIND. 
THE  CORRECT  T.O. 


Performed  or  Answered  Correctly  Yes  No 

Did  the  incumbent: 

1.  Obtain  the  appropriate  T.O.  for  the 
starter  installation  and  the  torquing 

procedures  within  10  minutes?  _  _ 

2.  Hang  the  clamp  per  the  specific  T.O.?  _  _ 

3.  Lubricate  the  spline?  _  _ 

4.  Ensure  that  the  starter  was  not  left 
in  an  unsupported  position  (hung  by 

the  shaft)  at  any  time?  _  — 

5.  Index  (position)  the  starter  per  the 

appropriate  T. 0. ?  — 

j-79:  Breech  at  8  o'clock  position 

J-57:  Breech  at  3  o'clock  position 

TF-33:  Drain  plug  at  6  o'clock  position 

6.  Properly  seat  the  V-Band  Clamp?  _  _ 

7.  Torque  the  V-Band  Clamp  per  the 

appropriate  T.O.?  — 

J-79  Airsearch:  110  to  130  inch-pounas 

J-79  Sunstrand:  65  inch-pounas 

j-57:  65  to  70  inch-pounas 

TF-33:  60  to  70  inch-pounas 

8.  Install  the  locking  device  on  the  V-Band 

Clamp  per  the  appropriate  T.O.?  — 

9.  Connect  the  applicable  electrical  connaotor 
(cannon  plug)?  (Muat  not  oonneot  the 

tachometer  generator  plug  on  the  J-57)  _  _ 
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Table  t.  (Concluded) 


Phase  I  J-79,  J-57 ,  TF-33  Hands-On  Task  3**7 

Shop  and  Fllghtllne 

10.  Use  the  correct  tools  and  materials?  _ 

STOP  TIME:  _ 


OVERALL  PERFORMANCE 

5  Far  exceeded  the  acceptable  level  of  proficiency 
4  Somewhat  exceeded  the  acceptable  level  of  proficiency 
3  Met  the  acceptable  level  of  proficiency 
2  Somewhat  below  the  acceptable  level  of  proficiency 
1  Far  below  the  acceptable  level  of  proficiency 
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Table  3. 


Interview  Task 


► 


Phase  III  TF-33  P7  Interview  Task  325 

FI igh  tl ine 

Objective :  To  evaluate  the  incumbent’s  knowledge  concerning 

the  determination  of  high  oil  consumption  on 
TF-33  engines. 

limss  _ Finish: _ Time  Reo  : _ 

Has. Limit # Times.  Performed: _ Las t..  Performed _ 

Tools  and  Equipment:  None.  T. 0.  1 C-1 41 B-2-4TS-1 ,  page  6-7. 
Background  Information :  N / A 
Engine  Configuration:  N/A 


Imatrufltlpna : 


Administer  in  the 
The  incumbent  may 
flow  path. 


shop  in  a  quiet  place. 

use  the  T.  0.  except  when  indicating  the  oil 


SAY  TO  THE  INCUMBENT 


I  AM  GOING  TO  ASK  YOU  SOME  QUESTIONS  ABOUT  TF-33  ENGINE  OIL 
CONSUMPTION.  YOU  MAY  USE  THE  T.  0.  AS  A  REFERENCE  WHEN 
ANSWERING  THESE  QUESTIONS  EXCEPT  FOR  THE  FIRST  QUESTION  WHICH 
DEALS  WITH  THE  OIL  FLOW  PATH. 


Performed  or  Answered  Correctly  Yes  No 


1.  Beginning  and  ending  at  the  oil  tank,  tell 
me  the  path  that  the  oil  flows  through 
the  following  components:  oil  tank, 
oil  bypass  valve,  oil  filter,  scavenge 
pumps,  oil  Jets  for  bearing  cavities  and 
sumps,  oil  pressure  relief  valve,  air  oil 
cooler,  fuel  oil  cooler,  oil  pump. 

Remember,  you  may  NOT  use  the  T.  0.  while 
answering  this  question. 

ANSWER:  Incumbent's  order  1-10 

a.  Oil  Tank  _ 

b.  Oil  Pump  _ 

c.  Oil  Pressure  Relief  Valve  _ 

d.  Oil  Filter  _ 

e.  Oil  Bypass  Valve  _ 

f.  Oil  Jet  for  Bearing 

cavities  and  sumps  _ 
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1 


- A. 


1 


Table  3.  (Continued) 


Phase  III  TF-33  P7  Interview  Tas.lt  325 

FI  igh  ti i ne 


Performed  or  Answered  Correctly  Yes  Mo 

g.  Scavenge  Pumps  _ 

h.  Air  Oil  Cooler  _ 

i.  Fuel  Oil  Cooler  _ 

j.  Oil  Tank  _ 

SAY  TO  THE  INCUMBENT 

NOW  YOU  MAY  USE  THE  T.  0.  IF  YOU  WISH 

2.  Name  four  areas  other  than  the  oil 
cooler  and  the  engine  cowling  that  you 
might  check  for  external  oil  leaks  or 
internal  oil  consumption. 

ANSWER: 

(The  incumbent  must  mention  at  least  four 
of  the  following  for  credit).  _  _ 

a.  Oil  tank  _ 

b.  Gear  box  _ 

c.  Garloc  seal  leaks  _ 

d.  Oil  pump  accessory  housing  _ 

e.  Pressure  lines  _ 

f.  Scavenge  lines  _ 

g.  Engine  inlet  _ 

h.  Engine  exhaust  _ 

i.  Combustion  case  split  line  _ 

3.  Why  is  the  engine  cowling  normally 
the  first  area  to  be  inspected  when 
determining  the  source  of  high  oil 
consumption? 

ANSWER  : 

Oil  in  the  cowling  would  indicate 

an  external  leak.  _  _ 

4.  What  is  the  purpose  of  performing 
a  breather  isolation  check? 

ANSWER  : 

To  determine  the  location  of  the 

ifltgraal  Oil  leak.  -  _ 
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Table  3.  (Continued) 


Phase  III  TF-33  P7 
FI i gh  1 1 i re 


Interview  Task  325 


Performed  or  Answered  Correctly  Yes 

5.  Other  than  checking  the  servicing 
level  yourself  or  asking  the  crew 
chief,  what  other  source  is  available 
for  determining  when  the  oil  system 
was  last  serviced? 

ANSWER: 

Aircraft  Forms  _ 

6.  What  readings  would  indicate  a 
restriction  in  the  scavenge  system? 

ANSWER: 

A  normal  oil  breather  pressure  reading 

and  a  high  oil  scavenge  pressure  reading.  _ 

7.  What  two  basic  pieces  of  information  would 
you  need  to  determine  whether  or  not  you 
had  an  excessive  oil  consumption  condition? 

ANSWER:  (Incumbent  must  answer  both  for  credit)  _ 

a.  The  number  of  flying  hours  _ _ 

b.  The  number  of  quarts  of  oil  serviced  _ 

8.  What  creates  the  oil  flow  from  the 
supply  tank  to  the  engine  pressure  pump? 

ANSWER: 

Gravity  _ 

9.  What  component  regulates  the  oil  pressure 
after  it  leaves  the  oil  pump? 

ANSWER : 

The  pressure  relief  valve.  _ 


STOP  TIME: 


NOTE:  TURN  PAGE  FOR  RATING  SCALE 


No 
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Table  3.  (Concluded) 


Phase  III  TF-33  P7 
FI  igh  tl  ine 


Interview  Task  325  P 


OVERALL  PERFORMANCE 

5  Far  exceeded  the  acceptable  level  of  proficiency 
4  Somewhat  exceeded  the  acceptable  level  of  proficiency 
3  Met  the  acceptable  level  of  proficiency 
2  Somewhat  below  the  acceptable  level  of  proficiency 
1  Far  below  the  acceptable  level  of  proficiency 
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The  interview  testing  is  conducted  at  the  worksite  in  a  "show-and-tel 1“  fashion,  such  that 
the  person  being  evaluated  can  "visually  and  verbally"  describe  how  a  step  is  to  be  accomplished 
(e.g.,  "that  bolt  is  to  be  turned  five  revolutions,"  or  “that  component  is  to  be  lubricated  prior 
to  being  assembled").  Thus,  additional  information,  not  otherwise  collected,  can  be  assembled 
along  with  hands-on  information  to  provide  a  more  thorough  coverage  of  the  content  domain,  and 
hopefully,  a  more  accurate  picture  of  an  individual's  job  proficiency. 


WTPT  Design  and  Procedural  Approach 

At  the  beginning  of  this  paper,  it  was  stated  that  two  of  our  objectives  were  to  expand 
coverage  of  the  job  content  domain  and  to  determine  if  measurement  comparability  could  be 
established  between  hands-on  and  interview  testing.  A  short  description  of  the  procedure  and 
design  for  the  WTPT  testing  will  clarify  how  these  objectives  are  met.  First  (for  the  Jet  Engine 
Mechanic),  in  approximately  7  hours  of  testing  time,  an  individual's  job  proficiency  is  measured 
on  10  hands-on  and  10  interview  testing  work  samples.  Five  of  the  interview  items  are  designed 
to  cover  unique  aspects  of  the  job  not  already  measured  through  hands-on  testing;  the  other  f’ive 
interview  items  measure  proficiency  on  tasks  already  covered  by  existing  hands-on  items.  These 
20  work  samples,  then,  make  up  the  Jet  Engine  Mechanic  WTPT.  For  testing  purposes,  the  20  items 
are  intermixed  and  presented  to  job  incumbents  at  the  worksite.  As  noted  previously,  these  tasks 
were  selected  according  to  a  three-phased  strategy.  The  breakdown  of  these  20  tasks  across 
phases  for  the  Jet  Engine  Mechanic  is  shown  in  Table  4. 


Table  4.  Breakdown  of  Work  S maple  Tests  by 
Phases  for  the  Jet  Engine  Mechanic  Specialty 


Unique  Items 

Overlap  Items 

Total 

Hands-on 

Interview 

Interview 

Hands-on 

Interview 

Phase 

I 

5 

0 

3 

5 

3 

Phase 

II 

3 

2 

2 

3 

4 

Phase 

III 

2 

3 

0 

2 

3 

10 

10 

Suawary  and  Conclusions 


Benefits  and  Drawbacks  of  WTPT 


As  mentioned  in  the  introduction,  AFHRt  has  Initiated  work  in  this  area  with  four  objectives 
in  mind.  We  believe  the  hands-on  work  sample  orientation  provides  a  high-fidelity  measurement 
methodology  that  can  be  enhanced  by  applying  the  WTPT  methodology.  Also,  through  the  use  of 
interview  testing,  measurement  of  the  job  content  domain  Is  expanded.  In  addition,  if 
comparability  of  hands-on  and  interview  testing  is  established  (at  least  In  some  Instances),  a 
savings  In  terms  of  testing  time  can  be  realized  (or  if  desired,  further  coverage  of  the  content 
domain  can  be  Included).  One  major  drawback  to  this  approach  is  the  cost  associated  with 
development  and  testing.  In  many  Instances,  the  benefits  gained  In  fidelity  mey  be  offset  by  the 
high  costs.  However,  although  developmental  costs  do  not  differ  for  hands-on  and  Interview 
testing,  testing  time  can  be  significantly  reduced  If  comparability  can  be  demonstrated. 
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6encraHzabiHty  of  the  WTPT  Methodology 


As  noted  earlier,  the  WTPT  methodology  is  targeted  for  a  variety  of  Air  Force  specialties  in 
the  next  5  to  7  years.  Because  this  measurement  methodology  is  designed  to  reflect  the  work 
performed  in  a  specialty,  its  orientation  should  reflect  the  job  content  domain.  For  instance, 
in  our  first  specialty.  Jet  Engine  Mechanic,  the  job  is  (to  a  large  extent)  knowledge-based;  this 
orientation  is  mirrored  by  the  WTPT. 

As  the  WTPT  approach  is  applied  to  other  specialties,  the  work  sample  content  must 
necessarily  change  as  well.  For  example,  as  selected  specialties  become  more  abstract  (e.g., 
Administrative  Specialist)  or  managerial  in  nature,  the  WTPT  may  begin  to  resemble  an  assessment 
center  approach.  In  any  event,  as  data  are  collected  and  evaluated,  decisions  will  be  made  about 
the  costs  and  benefits  of  adopting  such  a  strategy  for  measuring  job  proficiency. 


21 


A  TASK-LEVEL  OOMAIN  SAMPLING  STRATEGY: 

A  CONTENT  VALID  APPROACH 

M.  Suzanne  Lipscomb 
Air  Force  Human  Resources  Laboratory 

In  developing  task-based  job  performance  measures,  it  is  impractical  to  assess  performance  on 
the  universe  of  tasks  within  most  Air  Force  specialties  (AFSs).  No  individual  performs  all  of 
the  tasks  in  any  specialty;  and  in  most  specialties,  no  individual  performs  an  “average"  job. 
Rather,  the  tasks  of  a  specialty  are  distributed  by  management  action  to  individuals  in 
consistent  ways  so  as  to  cluster  into  a  variety  of  types  of  jobs,  based  on  the  co-performance  of 
tasks  and  the  variations  in  mission,  equipment,  or  management  in  any  given  locale.  This  variance 
of  jobs  within  AFSs  is  an  exceedingly  important  phenomenon  since  it  impacts  on  how  the  specialty 
is  organized  in  the  personnel  system,  the  aptitudes  required,  the  training  provided,  and  the  way 
individuals  can  be  utilized  in  the  workplace  (Mitchell  &  Driskill,  1979). 

This  variance  in  Air  Force  jobs  is  of  concern  to  Air  Force  managers  and  is  one  of  the  major 
issues  of  study  in  the  occupational  analysis  program  (Air  Force  Regulation  35-2,  Occupational 
Analysis  Program).  Data  on  most  AFSs  indicate  that  the  classification  structure  of  the  Air  Force 
is  highly  dynamic,  with  frequent  reallocation  of  tasks  among  specialties.  In  addition,  although 
there  may  be  some  cornnon  tasks  performed  by  a  majority  of  Individuals  within  a  specialty,  most  of 
the  tasks  are  performed  only  by  members  of  the  various  job  types  within  the  specialty. 

It  is  necessary  therefore  to  rely  on  samples  of  performance  that  are  both  useful  for 
differentiating  between  good  and  poor  performers  and  representative  of  the  performance  domain. 
Differentiating  good  and  poor  performers  can  be  accomplished  by  assessing  job  incumbent 
performance  on  tasks  with  a  range  of  difficulty.  In  addition  to  identifying  tasks  with  a  range 
of  difficulty,  selecting  tasks  that  adequately  represent  the  total  specialty  domain  is  necessary 
to  make  inferences  about  performance  from  the  sample  of  specific  task  measures  used.  If  the 
specialty  domain  is  adequately  represented  by  the  tasks  selected,  the  task-based  measurement 
system  can  be  considered  content  valid. 

Unlike  other  types  of  validity,  the  content  validity  of  a  measurement  procedure  is  not  a 
correlational  process  but  an  evaluation  of  adequacy  and  representativeness  using  rational 
judgments.  Lennon  (1956)  stated  that  three  assumptions  underlie  the  use  of  content  validity: 
(a)  the  area  of  concern  to  the  user  can  be  conceived  as  a  meaningful,  definable  universe  of 
responses;  (b)  a  sample  can  be  drawn  from  the  universe  In  some  purposeful,  meaningful  fashion; 
and  (c)  the  sample  and  the  sampling  process  can  be  defined  with  sufficient  precision  to  enable 
the  user  to  judge  how  adequately  the  sample  of  performance  typifies  the  universe  of  performance. 

Given  the  information  available  In  the  Air  Force  Occupational  Research  Data  Base,  these  three 
assumptions  can  be  met;  thus,  the  issue  of  content  validity  can  be  addressed.  The  universe  of 
responses  can  be  defined  as  the  universe  of  tasks  for  an  AFS,  as  derailed  by  the  occupational 
survey  report  task  list.  The  sample  can  be  drawn  in  a  meaningful  fashion  based  on  the  task-level 
occupational  survey  data  available.  Finally,  the  sampling  process  can  be  defined  with  precision 
using  a  task  sampling  plan  which  will  allow  a  Judgment  to  be  made  as  to  the  adequacy  of  the 
sample. 

A  task  sampling  plan  consisting  of  a  procedural  set  of  guidelines  must  be  developed:  (a)  to 
specify  the  job  and  task  domains  of  Interest,  (b)  to  establish  the  level  of  measurement 
specificity,  and  (c)  to  determine  the  proportional  weighting  (Importance)  of  the  work  activities 
identified.  Such  guidelines  will  assure  objectivity,  replicability,  and  comparability  of  efforts 
to  develop  measures  which  detect  meaningful  differences  In  performance.  These  guidelines  are 
presented  as  they  apply  to  enlisted  AFSs.  Also  provided  Is  an  Illustration  of  their  use  for 
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selecting  tasks  within  the  first  AFS  to  be  Investigated  for  the  Joint-Service  Job  Performance 
Measurement  Project,  Jet  Engine  Mechanic  (AFS  426X2). 


Task  Selection  Procedural  Guidelines 


Defining  the  Job  Domain 

For  most  AFSs,  the  Air  Force  has  a  wealth  of  information  sources,  which  give  a  comprehensive 
picture  of  the  work  domain.  These  sources  give  the  AFS  entrance  requirements  and  a  general 
specialty  description  (AFR  39-1,  Airman  Classification  Regulation);  AFS  training  requirements 
(AFR  50-5,  USAF  Formal  Schools  and  Specialty  Training  Standards);  and  occupational  survey  data. 

Occupational  survey  data  include  the  percentage  of  incumbents  performing  specific  tasks  and 
the  relative  time  they  spend  on  each,  as  well  as  SMEs'  judgments  of  the  relative  time  required  to 
learn  to  perform  tasks  (i.e.,  task  difficulty)  and  the  relative  importance  of  training  for  dach 
task  (i.e.,  recommended  training  emphasis).  Occupational  survey  and  training  information  cover 
the  full  scope  of  tasks  performed  by  incunbents  in  an  AFS,  and  therefore  can  be  applied  to  the 
development  of  the  task-based  performance  measures. 

Because  occupational  survey  data  provide  the  most  detailed  and  comprehensive  source,  they 
will  be  used  to  define  the  work  domain.  The  other  sources  will  provide  complementary  information. 

The  goals  of  the  Air  Force  job  performance  measurement  program  are  to  assess  specific  job 
competencies  required  within  a  specialty  and  general  competencies  applicable  across  AFSs.  These 
two  types  of  measures  require  four  levels  of  measurement  specificity:  Air  Force-wide, 
specialty-wide,  duty-core,  and  incumbent-unique  measures.  Because  the  focus  of  this  paper  is  on 
selecting  tasks  required  to  measure  individuals'  competence  within  an  AFS,  the  latter  three 
levels  of  measurement  specificity  will  be  highlighted. 

To  include  an  adequate  representation  for  each  of  these  three  levels  of  measurement 
specificity,  tasks  within  an  AFS  must  be  categorized  accordingly.  That  is,  tasks  can  be 
categorized  into  those  performed  throughout  the  specialty  (i.e.,  specialty-wide),  those  specific 
to  certain  duties  within  an  AFS  (I.e.,  duty-core),  and  those  uniquely  performed  by  incumbents  in 
certain  job  types  (i.e.,  incumbent-unique). 

The  occupational  survey  task  Inventory  will  be  used  to  define  the  work  domain  and  categorize 
tasks.  As  task  performance  Is  often  specific  to  equipment  or  work  centers,  tasks  associated  with 
equipment  or  work  centers  will  be  used  to  Identify  the  duty-core  domain.  Finally,  tasks 
associated  with  specific  job  types  defined  by  the  occupational  analysis  will  delineate  the 
incuabent-unlque  domain.  Because  It  would  be  Impractical  to  cover  adequately  all  duty  areas  and 
Job  types  within  heterogeneous  AFSs,  those  most  representative  of  the  work  performed  will  be 
selected.  That  Is,  duty  areas  and  job  types  which  Involve  the  largest  percentage  of  personnel 
will  be  chosen. 


Selecting  Tasks  Representative  of  the  Job  Domain 

The  procedures  for  sampling  tasks  representative  of  the  three  task  domains  are  outlined  In 
the  following  paragraphs,  along  with  the  rationale  for  these  procedures.  For  each  task  domain, 
the  maeber  of  tasks  selected  should  be  based  on  a  judgment  of  the  number  of  performance  measures 
required  to  give  an  adequate  sample,  while  conforming  to  a  total  testing  time  of  no  more  than  8 
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hours  for  all  Measures.  This  time  limit  is  considered  the  maximum  time  feasible  to  keep  an 
airman  away  from  his/her  unit.  Within  this  timeframe,  individuals  will  be  assessed  on 
specialty-wide,  duty-core,  and  incumbent-unique  tasks. 


Phase  I.  Selection  of  Specialty-Wide  Tasks 

Step  1.  Select  all  tasks  that  are  included  in  the  Plan  of  Instruction  (PQI)  for  initial  AFS 
training  or,  if  not  in  the  POI,  are  performed  by  at  least  30*  of  the  first-term  incumbents  with  1 
through  48  months  of  total  active  Federal  military  service). 

This  will  reduce  the  task  pool  to  those  tasks  deemed  Important  enough  to  train  or  those  which 
are  performed  by  a  substantial  number  of  first-term  airmen  across  the  AFS.  (The  30*  cutoff  value 
may  be  varied  by  specialty  according  to  the  number  of  tasks  performed  by  f irst-termers  in  that 
specialty.) 

Step  2.  Cluster  tasks  selected  in  Step  1  based  on  one  of  the  following:  (a)  factor  analysis 
based  on  tasks  performed  together,  (b)  Specialty  Knowledge  Test  outline,  (c)  Specialty  Training 
Standard  outline,  or  (d)  occupational  survey  inventory  duty  outline.  Each  of  these  is  a  means  of 
organizing  the  pool  of  tasks  into  performance/knowledge  areas  based  on  occupational  information. 
All  will  produce  similar  results;  thus,  the  selection  of  the  grouping  strategy  should  be  based  on 
a  judgment  as  to  which  is  cost  effective  and  best  suited  to  the  development  of  performance 
measures  for  a  specific  AFS. 

Step  3.  Weight  each  task  cluster  to  reflect  its  relative  Importance  to  the  overall 
performance  of  first-term  airmen  within  the  specialty.  Possible  indices  for  weighting  clusters 
include  the  following:  (a)  Specialty  Knowledge  Test  outline  weights,  (b)  Specialty  Training 
Standard  proficiency-level  requirements,  (c)  SHE  judgments  of  relative  Importance,  or  (d)  weights 
derived  from  existing  task  factor  data.  Relevant  task  factor  information  Includes  recommended 
training  emphasis  ratings  (l.e.,  SHE  judgments  of  the  extent  to  which  training  is  required  for 
tasks)  and  percent  time  spent  values  (l.e.,  incumbent  ratings  of  the  relative  time  spent 
performing  tasks).  These  task  factor  data  can  be  used  to  derive  weights  by  generating  the 
product  of  the  mean  recommended  training  emphasis  rating  and  the  cumulative  percent  time  spent 
performing  tasks  in  a  cluster. 

Step  4.  Determine  the  nunber  of  tasks  to  be  selected  from  each  cluster  to  reflect  the 
assigned  weights  as  follows.  Total  the  cluster  weights.  Divide  each  cluster  weight  by  the  total 
to  get  a  percentage.  Multiply  each  cluster  percentage  by  the  total  possible  tasks  to  find  the 
nunber  of  tasks  to  be  selected  for  each  cluster. 

Step  5.  Within  each  cluster,  select  the  number  of  tasks  determined  in  Step  4  to  reflect  a 
range  of  learning/task  difficulty  by:  (a)  ranking  the  tasks  from  low  to  high  task  difficulty, 
(b)  dividing  the  ranked  list  Into  quartlles,  (c)  selecting  40*  of  the  tasks  from  the  fourth 
quartile,  (d)  selecting  30*  from  the  third  quartlle,  (e)  selecting  20*  from  the  second  quartlle, 
(f)  selecting  10*  from  the  first  quartlle,  (g)  repeating  for  each  cluster.  (It  Is  Important  to 
sample  tasks  with  a  range  of  difficulty  so  Incumbent  performance  assessment  will  reflect  the 
rank-ordering  of  people  of  varying  levels  of  job  competence.  The  sampling  Is  wire  heavily 
weighted  on  the  more  difficult  tasks  because  they  determine  the  aptitude  requirements  of  the 
specialty  and  are  also  where  most  performance  variation  should  occur.) 


Step  6.  Review  the  tasks  Identified  In  Step  5  to  determine  If  they  can  be  measured  by  either 
the  hands-on  or  Interview  component  of  WTPT.  Reject  any  task  found  to  be  unsuitable  for  WTPT, 
and  document  the  reason  It  was  judged  unsuitable.  If  possible,  select  a  replacement  task  from 
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the  s we  task  difficulty  quartlle.  (The  ability  to  assess  performance  through 
observation/ Interview  procedures  [wTPT]  Is  a  prerequisite  to  final  task  selection  because 
performance  measures  obtained  via  these  high-fidelity  techniques  will  be  the  benchmarks  against 
which  surrogate  measures  are  compared.) 

Phase  II.  Selection  of  Duty-Core  Tasks 

Because  the  performance  domain  for  a  duty  area  (e.g.,  a  specific  engine  type  or  work  center) 
is  less  broad  than  for  the  entire  specialty,  fewer  tasks  are  needed  for  an  adequate  sample. 
Also,  because  tasks  selected  for  one  duty  area  may  be  performed  In  another,  tasks  can  be  selected 
for  more  than  one  duty  area.  However,  since  tasks  selected  In  Phase  I  for  specialty-wide 
measures  will  be  used  to  assess  all  incunbents,  they  should  not  be  used  to  develop  duty-core 
measures.  The  following  steps  apply  for  each  duty  area. 

Step  1.  Select  from  among  those  task  not  utilized  In  Phase  1  all  tasks  performed  by  at  least 
40*  (as  noted  earlier,  this  cutoff  may  vary  according  to  the  number  of  tasks  performed  by 
first- termers)  of  the  first-term  airmen  Identified  as  performing  the  duty  In  question.  (Within 
each  duty  area,  a  higher  proportion  of  Incumbents  performing  tasks  can  be  used  as  the  basis  for 
identifying  tasks  to  be  assessed  because  the  performance  domain  Is  more  narrowly  defined  than 
across  the  entire  specialty.) 

Step  2.  Prom  the  tasks  Identified  In  Step  1,  select  tasks  which  reflect  a  range  of 
learning/task  difficulty  by  repeating  Phase  I,  Step  5. 

Step  3.  Repeat  Phase  I,  Step  6. 


Phase  III.  Selmctlon  of  Incumbent-Unique  Tasks 

Because  the  performance  domain  for  each  job  type  Is  much  less  broad  than  for  the  entire 
specialty,  fewer  tasks  are  needed  to  provide  an  adequate  sanple.  Also,  the  tasks  selected  for  a 
job  type  may  be  applicable  to  more  than  one  job  type;  however,  tasks  selected  In  Phases  I  or  II 
should  not  be  used  to  develop  Incumbent- unique  measures.  The  following  steps  apply  for  each  job 
type. 


Step  1.  Select  all  tasks  performed  by  50*  or  more  of  the  Incumbents  In  the  Incumbent-unique 
group  and  not  utilized  In  Phases  I  or  II.  (Again,  as  the  job  domain  becomes  more  specific.  It  Is 
possible  to  select  tasks  performed  by  a  higher  proportion  of  Incumbents.  In  addition,  the  cutoff 
may  vary  by  number  of  tasks  performed  by  first- termers.) 


Step  2.  From  the  tasks  Identified  In  Step  1,  select  tasks  which  reflect  a  range  of 
learning/task  difficulty  by  repeating  Phase  I,  Step  5. 


Step  3.  Repeat  Phase  I,  Step  6. 


Review  end  Approval  of  Test  Sample 


Upon  application  of  these  task  sampling  procedures,  the  specialty-wide  tasks  selected  for 
each  AFS  were  reviewed  by  appropriate  AFS  functional  managers  and  technical  training 
representatives,  who  provided  feedback  concerning  the  adequacy  of  the  tasks  selected.  Reviewers 
examined  the  task  sample  to  ensure  that  work  performed  by  first-term  airmen  and  critical  wartime 
requirements  were  well  represented.  Approval  of  the  task  sample  by  these  policy-makers  should 
Increase  the  acceptance  and  utilization  of  the  resulting  job  performance  data. 
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Application  to  the  Jet  Engine  Mechanic  Specialty  (AfS  <26X2) 


Before  the  sampling  plan  could  be  applied  to  the  Jet  Engine  Mechanic  Specialty,  duty  areas 
and  incumbent-unique  job  types  were  identified  using  the  following  procedures. 


Defining  the  Job  Domain 

Outy  Areas  Selected.  Duty  areas  were  selected  based  on  the  type  of  engine  maintained.  An 
inspection  of  the  occupational  survey  data  revealed  that  20X,  18X,  and  17X  of  AFS  426X2 
first-term  airmen  performed  maintenance  tasks  on  J-57,  J-79,  and  TF-33  engines,  respectively. 
8ecause  these  percentages  were  the  highest  among  the  nine  engine  types  maintained  by  AFS  426X2 
personnel,  these  three  engines  were  selected  as  being  representative  of  equipment  maintained  by 
first-term  jet  engine  mechanics. 

Job  Types  Selected.  The  occupational  survey  data  also  revealed  that  the  vast  majority  of 
first-term  jet  engine  mechanics  performed  similar  jobs  (i.e.,  most  airmen  maintained  simflar 
engine  accessory  systems).  The  largest  percentage  of  first-term  incumbents  in  each  major  conmand 
(MAJCOM)  spent  the  majority  of  their  time  performing  general  engine  maintenance  tasks  in  shop  or 
on  the  flightline.  As  a  result,  these  two  functional  areas  were  identified  as  being 
representative  of  AFS  426X2. 


Phase  I.  Selecting  Specialty-Hide  Tasks 

Task  Clustering.  Tasks  were  clustered  by  occupational  survey  duty  area  because  this  grouping 
adequately  reflected  the  work  done  in  the  specialty  and  was  cost  effective.  Weights  were 
computed  based  on  the  product  of  the  mean  recomnended  training  emphasis  rating  and  the  cumulative 
percent  time  spent  performing  tasks  in  a  cluster.  The  following  six  task  clusters  received  the 
weights  indicated  below. 


Cluster  Weight 


Preparing  and  Maintaining  Forms,  Records  and  Reports  10 
Performing  Quality  Control  Functions  5 
Performing  Flightline  Engine  Maintenance  Functions  10 
Performing  In-Shop  Engine  Maintenance  Functions  20 
Performing  Test  Cell  Functions  5 
Performing  General  Engine  Maintenance  Functions  50 


Task  Selection  and  Review.  The  remaining  Phase  I  steps  were  followed,  and  18  tasks  were 
selected  to  reflect  the  weights  outlined  above.  Ten  tasks  each  were  selected  for  each  engine 
type  in  Phase  II  and  for  each  job  type  In  Phase  III.  Selected  tasks  were  reviewed  by  SMEs,  and 
unsuitable  tasks  were  deleted. 

New  tasks  were  selected  and  reviewed,  and  the  task  list  was  finalized,  giving  a 
representative  set  of  tasks  on  which  to  develop  performance  measures.  The  main  Justifications 
for  the  task  exclusions  were: 

1.  Task  not  cannon  to  all  engines  (Phase  I). 

2.  Task  performed  differently  on  different  aircraft  (Phase  I). 

3.  Task  not  cannon  to  all  functional  areas  (Phases  I  and  II). 
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4.  Task  not  representative  of  functional  area  (Phase  111). 

5.  Task  unclear,  too  broad,  complex,  or  trivial. 

6.  Task  overlapping  or  similar. 

7.  Task  performed  differently  depending  on  how  engine  is  shipped  (air,  rail,  or  truck)  and 
its  destination  (depot,  deployment). 

8.  Task  performed  differently  depending  on  organizational  unit  (Examples:  Some  supervisors 
do  not  allow  test  cell  personnel  to  transport  engines.  SAC  flightline  personnel  do  not  make 
entries  on  oil  analysis  request  forms  (00  Form  2026),  but  MAC  flightline  personnel  do  make  such 
entries). 


9.  Task  Involves  equipment  being  changed  within  the  year. 

In  summary,  a  strategy  for  task  selection  was  developed  to  sample  tasks  representative  of  .'the 
job  content.  This  strategy  was  applied  to  the  Jet  Engine  Mechanic  Specialty  (426X2),  and  the 
selected  tasks  were  used  to  develop  the  performance  measures  and  standards  described  in  the 
following  paper. 
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DEVELOPING  PERFORMANCE  MEASURES  AND  STANDARDS 
FOR  ACCURATE  ASSESSMENT 

Rodger  D.  Ballentlne 
and 

M.  Suzanne  Lipscomb 
Air  Force  Human  Resources  Laboratory 

Once  representative  task  statements  from  the  occupational  inventory  have  been  identified  by 
the  Task  Selection  Plan,  and  refined  through  SM£  input,  the  developmental  focus  shifts  toward 
task  analysis  and  item  writing.  This  paper  will  highlight  the  time  period  from  task 
identification  through  construction  of  work  sample  tests  to  measure  performance  on  these  tasks. 
Discussion  will  center  on  the  task  analysis  process,  item  writing,  pilot  testing,  and  final 
selection  of  test  items. 


Task  Analysis  Process 

The  objective  of  the  task  analysis  process  is  to  gather  information  essential  to  WTPT  item 
development.  Once  the  task  statements  have  been  identified  using  the  Task  Selection  Plan 
discussed  by  Lipscomb  in  the  previous  paper,  these  statements  are  taken  to  operational  units  and 
discussed  with  SMEs  (senior  noncomnissioned  officers).  Because  occupational  inventory  task 
statements  vary  from  rather  general  to  quite  specific,  detailed  discussions  are  required  to 
clarify  task  boundaries  and  the  nature  of  the  work  performed.  Several  interview  workshops  with 
SMEs  from  field  units  will  provide  the  majority  of  the  information  needed  to  construct  items. 

In  these  working  sessions,  technical  orders  and  job  guides  are  used  to  identify  the  steps 
involved  in  task  performance,  the  correct  procedures  to  be  used  by  job  Incumbents,  and  the 
sequencing  of  steps  if  a  rigid  ordering  is  required.  An  extensive  listing  of  information 
gathered  during  the  task  analysis  process  can  be  found  In  Table  5,  but  several  Issues  need  to  be 
highlighted  here. 


Table  5.  Information  6athared  During  the  Task  Analysis  Process 


1.  Task/Step  Information 

a  Consequences  of  Incorrect  performance 
a  Identification  of  critical  steps  (criticality  criteria) 

--Safety  of  personnel/equipment 
—Required  for  proper  task  completion 
—Required  to  maintain  proper  maintenance  procedures 
a  Estimated  frequency  of  Incorrect  performance 
a  Sequencing  of  steps 

2.  Amount  of  time  required  to  perform  task  , 

3.  Required  technical  orders  (job  guides),  tools,  and  equipment 

4.  Number  of  people  required  to  perform  task 

5.  Do  first-termers  typically  perform  task? 

6.  Are  general  maintenance  procedures  used— or  are  there  unique  base,  KAJCOH,  or 
engine  procedures? 


?.  If  Interview  testing  Is  used,  are  there  specific  questions  to  ask  that  will 
Identify  ability  to  perform? _ 


Much  of  the  additional  information  gathered  during  task  analysis  can  be  categorized  as 
logistical  in  nature.  When  decisions  are  being  made  about  final  testing  items,  and  ho«  testing 
is  to  be  accon^lished,  logistical  information  is  critical  to  successful  data  collection.  For 
example,  the  amount  of  time  required  to  perform  a  task,  whether  more  than  one  person  is  required 
for  task  completion,  and  consequences  of  incorrect  performance  (i.e.,  personal  injury  or 
equipment  damage)  are  all  factors  which  must  be  considered  in  designing  test  items  and  selecting 

the  most  appropriate  work  sample  testing  modal ity--hands-on  or  interview  procedures. 

To  gather  this  task-centered  and  logistical  information  for  the  Jet  Engine  Mechanic  WTPT, 
test  developers  visited  12  bases.  Workshops  involving  a  total  of  75  SMEs  were  conducted  for 

task  analysis.  This  extensive  task  analysis  process  was  required  because  tasks  were  selected 

specific  to  three  engine  types  in  the  specialty.  Table  6  shows  the  types  and  numbers  of  job 
experts  interviewed  during  the  Jet  Engine  Mechanic  task  analysis. 


Table  6.  Bases  fisted  and  Number  of  SMEs  Interviewed 
During  Task  Analysis 


Air  force  base 

Engine 

Nueber  of  SMEs 

Bergstrom  AFB 

J-  79 

10 

Carswell  AFB 

J-57/TF-33 

11 

Altus  AFB 

J-57/TF-33 

6 

Seymour  Johnson  AFB 

J- 79 

7 

Shaw  AFB 

J-79 

6 

Barksdale  AFB 

J-57 

5 

Oyess  AFB 

J-57 

5 

Blythevi 1 le  AFB 

J-57 

3 

Travis  AFB 

TF-33 

5 

Norton  AFB 

TF-33 

5 

Kelly  AFB 

J-57/TF-33/J-79 

5 

Randolph  AFB 

J-57/TF-33/J-79 

7 

An  important  added  benefit  of  these  task  analysis  base  visits  was  the  opportunity  afforded 
the  test  developers  for  equipment  and  work  site  familiarization.  Periodically,  during  the  course 
of  interviews  with  SMEs,  confusion  arose  concerning  the  placement  and  functioning  of  a  particular 
step.  By  visting  the  work  site,  test  developers  could  learn  “first-hand"  what  was  being 
discussed. 

In  addition  to  these  base  visits,  a  centralized  workshop  was  held  to  gather  task  analysis 
information  for  Phase  I  tasks.  Representatives  for  all  engine  types  and  personnel  from  the 
Technical  Training  School  were  assembled  to  generate  all  required  information  to  construct  tests 
for  these  tasks. 

These  SMEs  also  assisted  test  developers  in  evaluating  tasks  from  all  phases,  specifically  in 
terms  of  whether  certain  tasks  could  be  clustered  into  modules  (e.q.,  three  documentation  tasks 
combined  into  one  test  item). 

In  sumaary,  the  objective  of  the  task  analysis  process  is  to  gather  information  essential  to 
WTPT  item  development.  Relevant  information  Includes  beginning  and  ending  points  for  each 
specific  task,  critical  steps  for  task  accomplishment,  logistical  requirements  for  task 
completion,  required  configuration  of  equipment,  time-critical  and  safety  steps,  effects  of  local 
operating  procedures  on  task  performance,  and  representativeness  of  the  task.  This  information 
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Is  gathered  by  referencing  applicable  regulations,  technical  orders,  and  local  operating 
instructions,  as  well  as  through  discussions  with  SMEs  in  the  field.  The  desired  result  of  the 
analysis  is  a  comprehensive  list  of  steps  required  for  successful  task  completion.  These  steps 
should  be  general izable  to  any  situation  in  which  the  task  might  be  observed  and  should  be  used 
to  objectively  evaluate  an  individual's  performance  on  the  task. 


MTPT  Item  Writing 

Once  the  task  analysis  process  has  been  completed,  test  developers  have  the  information 
necessary  to  write  test  items.  Remember,  data  concerning  logistics,  feasibility  for  hanos-on  or 
interview  testing,  number  and  criticality  of  steps  have  been  gathered.  Now,  the  test  developer 
utilizes  this  information  to  organize  and  compose  test  items  reflecting  steps  required  to  perform 
each  task. 

When  writing  items,  the  test  developer  must  decide  (based  on  SME  input)  which  steps  represent 
the  behavioral  elements  necessary  to  adequately  characterize  that  task.  This  may  invo-lve 
eliminating  unnecessary  steps  that  add  little  to  successful  task  performance  (and  little  to  the 
discriminating  power  of  the  item).  Similarly,  certain  lengthy  tasks  may  consist  of  several 
subtasks,  one  or  more  of  which  is  representative  of  the  behavioral  requirements  of  the  entire 
task.  In  such  cases,  an  item  can  be  constructed  utilizing  one  subtask  with  similar  properties 
and  task  difficulty  levels.  Specific  criteria  used  in  selecting  subtasks  are  listed  in  Table  7. 

Table  7.  Criteria  to  Select  Measurable  Subtasks 


Representative  of  the  general  task? 

Discriminator  between  a  good  and  bad  performer? 

Measurable  using  job  performance  methodology? 

Capable  of  being  performed  in  a  testing  situation? 

Capable  of  being  completed  within  the  testing  time  limitations? 

Frequently  performed  by  first-term  airmen? _ 

Another  important  component  of  this  process  Is  an  evaluation  of  whether  hands-on  or  interview 
items  should  be  written  to  represent  a  task.  As  Hedge  noted  earlier,  the  WTPT  methodology  calls 
for  using  interview  testing  when  time,  cost,  or  safety  considerations  suggest  hands-on  testing  is 
impractical.  In  the  present  effort,  because  interview  testing  was  to  be  evaluated  as  a  potential 
surrogate  for  hands-on  testing,  both  hands-on  and  Interview  items  were  developed  for  some  tasks. 

As  shown  in  Table  2  of  Hedge's  paper,  hands-on  and  Interview  items  were  constructed  such  that  six 

tests  (3  engine  types  x  2  functional  areas)  were  compiled,  with  each  test  consisting  of  10 
hands-on  and  10  interview  Items.  In  all,  59  unique  WTPT  Items  were  constructed  for  the  Jet 
Engine  Mechanic  specialty  (i.e.,  8  for  Phase  I,  21  for  Phase  II,  and  30  for  Phase  III).  Each 

item  was  then  reviewed  and  refined  by  three  to  four  groups  of  SMEs. 

Pilot  Testing 

Once  items  had  been  written,  reviewed,  and  gathered  into  test  booklets,  pilot  tests  were  set 
up  at  three  Air  Force  bases  (one  per  engine  type).  Pilot  testing  served  multiple  purposes. 
Items  were  evaluated  to  ensure  that  all  critical  steps  were  Included  and  were  sequentially 
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correct,  thet  Item  were  applicable  to  first-tene  airmen,  etc.  Booklets  were  also  assessed  in 
terms  of  clarifying  Instructions,  test  booklet  layout,  and  inclusion  of  applicable  technical 
orders  or  Job  guides.  In  addition,  task  setup  and  equipment  requirements  were  checked,  as  well 
as  base  support  required  (e.g.,  personnel /equl patent  availability).  Equipment  use  and  safety 
Issues  were  also  Identified.  Finally,  time  to  complete  all  tasks  In  the  WTPT  was  exmnlned  to  be 
sure  the  total  test  could  be  completed  within  the  time  allotted.  Thus,  information  gained  from 
the  pilot  test  allowed  appropriate  revisions  to  instrunents  and  procedures  in  preparation  for 
data  collection. 


Lessons  Learned 


Having  worked  through  the  development  of  performance  measures  for  the  Jet  Engine  Mechanic, 
valuable  insights  were  gained  about  the  adequacy  of  the  developmental  process. 

Field  experience  indicated  that  SME  judgments  during  the  task  selection  workshop,  as 
described  by  Lipscomb,  were  correct  and  Invaluable.  Consequently,  It  Is  recommended  that  -the 
task  selection  plan  review  and  modification  be  based  on  SME  analysis  of  tasks  (rather  than  total 
reliance  on  occupational  survey  task  data).  Such  input  can  provide  Important  Information  about 
how  specific  or  how  general  task  statements  are,  as  well  as  how  much  procedures  vary  because  of 
equipment  or  environment.  This  helps  to  Identify  tasks  having  the  same  level  of  specificity,  or 
modules  of  work  which  first-term  airmen  typically  perform. 

SME  Input  is  also  essential  during  the  task  analysis  process.  It  Is  improbable  that  test 
developers  will  possess  the  knowledge  and  skills  necessary  to  understand  the  intricacies  of  the 
specialty  for  which  Items  are  being  written.  With  the  use  of  SMEs,  In  fact,  test  developer 
naivete  becomes  a  benefit.  In  that  SMEs  are  forced  to  explain  task  procedures  in  detail.  In 
addition,  SME  Input  during  task  analysis  and  item  development  will  assist  test  developers  in 
identifying  modules  of  work  performed.  This  grouping  of  tasks  will  facilitate  the  development 
and  testing  process. 

In  summary,  the  task  analysis  and  Item  writing  process  is  a  lengthy,  yet  necessary  component 
in  constructing  work  sample  tests.  The  procedures  used  In  task  analysis  and  Item  writing  were 
described,  and  the  Information  gathered  at  each  point  was  detailed.  Throughout  the  process,  the 
use  of  SMEs  was  shown  to  be  essential  to  developing  an  accurate,  well-constructed  work  sample 
test. 
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SOME  COMMENTS  ON  WALK-THROUGH  PERFORMANCE  TESTING 
Terry  L.  Dickinson 
Old  Oominion  University 


My  remarks  will  primarily  be  directed  to  support  the  Air  Force  in  its  attempts  to  develop  a 
program  for  job  performance  measurement.  I  am  encouraged  by  the  effort  to  implement 
state-of-the-art  technologies  in  performance  measurement  and,  also,  by  the  concern  to  approach 
this  implementation  with  an  attitude  of  research-mindedness.  I  believe  that  applications  in 
psychology  often  lead  to  the  identification  of  new  research  issues  as  well  as  the  resolution  of 
old  ones.  It  is  with  these  thoughts  in  mind  that  l  will  discuss  each  of  the  papers  in  this 
symposium. 

Or.  6ould  and  Or.  Hedge  have  indicated  that  the  Air  Force  is  developing  its  program  of  WTPT 
to  address  the  "criterion  problem."  Performance  measures  constructed  with  this  technology  Will 
serve  as  high-fidelity  benchmarks  against  which  other  measures  such  as  peer  and  supervisory 
ratings  will  be  compared  for  substitutability.  For  many  jobs  and  their  tasks,  WTPT  would  appear 
to  provide  these  benchmarks.  In  particular,  tasks  that  result  in  products  (Osburn,  1973)  are 
most  appropriate  for  comparing  ratings  and  WTPT  measures.  Assuming  that  the  tasks  have  been 
sampled  to  represent  critical  requirements  of  job  performance,  the  fidelity  of  WTPT  measures 
cannot  be  questioned.  Such  measures  will  truly  be  work  samples,  and  performance  on  these  samples 
can  be  referenced  to  job  content. 

However,  for  jobs  and  their  tasks  that  do  not  have  products  directly  linked  to  performance, 
it  is  questionable  whether  WTPT  can  provide  measures  that  can  logically  be  thought  of  as 
high-fidelity  benchmarks.  For  many  jobs,  the  process  of  performance  is  as  important  as  the  final 
product  of  performance,  and  for  some  jobs,  the  product  is  the  process  of  performance.  Work 
samples  could  be  constructed  for  these  jobs.  However,  performance  on  the  samples  would 
necessarily  involve  observation  by  judges  and  the  rating  of  that  performance.  Since  there  may  be 
several  acceptable  ways  to  perform  job  tasks,  the  work  samples  must  be  constructed  very  carefully 
to  allow  for  the  full  range  of  acceptable  performance.  Of  course,  it  may  be  impossible  to 
construct  work  samples  for  some  jobs  that  will  totally  simulate  the  range  of  acceptable 
performance.  Part  of  the  Air  Force's  research  efforts  should  be  directed  toward  exploring  the 
product  versus  process  description  of  tasks  in  terms  of  identifying  jobs  for  which  WTPT  can 
provide  high-fidelity  benchmarks. 

Another  goal  of  WTPT  is  to  operationalize  the  test  development  and  performance  measurement 
procedures  such  that  they  can  be  conducted  by  technicians.  Clearly,  training  programs  need  to  be 
developed  so  that  technicians  can  (a)  construct  task  sampling  plans  and  execute  these  plans,  (b) 
construct  the  tests  and  their  administration  procedures,  and  (c)  administer  the  tests  at  a 
variety  of  sites.  The  Implementation  of  these  programs  suggests  an  interface  between  personnel 
psychology  and  organizational  development  activities.  I  believe  that  lessons  can  be  learned  from 
the  technology  of  organizational  development  that  can  aid  in  the  efficient  accomplishment  of  this 
goal.  Finally,  these  training  programs  will  need  to  be  evaluated  rigorously  to  ensure  that  the 
WTPT  measures  that  are  produced  by  technicians  meet  acceptable  professional  standards. 

I  like  the  fact  that  a  conceptual  model  of  performance  measurement  Is  being  used  to  guide 
WTPT.  This  model  clearly  suggests  the  appropriateness  of  different  methods  and  sources  for 
measuring  different  Job  content  domains.  For  example,  self-ratings  are  assumed  to  be  best  for 
measuring  technical  skills;  peer  ratings  are  better  for  Interpersonal/social  skills;  and 
supervisory  ratings  are  good  for  both  types  of  skills.  Continued  Iteration  and  refinement  of  the 
model  should  be  an  Important  goal  In  the  Air  Force's  RIO  program.  Refinement  of  the  model  will 
ensure  that  the  program  Is  conceptually  driven.  For  exanple,  the  current  application  of  WTPT  to 
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jet  engine  mechanics  is  assessing  the  comparability  of  the  hands-on  component  and  the  interview 
component  for  describing  performance  on  a  subset  of  tasks.  The  selection  of  these  tasks  has  been 
driven  by  practical  considerations  such  as  time  to  perform,  safety,  and  potential  damage  to 
equipment.  These  are  important  considerations.  However,  1  suggest  that  the  Air  Force  attempt  to 
identify  the  task  attributes  that  specify  the  degree  of  interchangeability  of  hands-on  and 
interview  components.  A  key  for  identifying  these  task  attributes  may  be  the  assumption  that  job 
knowledge  (i.e.,  what  to  do)  implies  job  proficiency  (i.e.,  can  do).  Research  should  identify 
the  task  attributes  that  weaken  or  strengthen  the  relationship  between  job  knowledge  and  job 
proficiency. 

Dr.  Hedge  described  the  methodology  of  WTPT.  He  noted  that  a  task  sampling  plan  is  used  to 
define  the  job  content  domain  from  which  tasks  are  sampled  to  develop  work  sample  tests.  Here,  I 
think  it  is  important  to  make  the  distinction  between  a  job  content  domain  and  a  job  content 
universe.  As  Guion  (1979)  noted,  the  job  content  universe  includes  all  the  nontrivial  tasks, 
responsibilities,  and  organizational  relationships  inherent  in  a  job.  The  job  content  domain  is 
a  subset  of  the  universe  and  consists  of  that  portion  of  the  universe  that  is  identified  for  the 
purposes  of  testing. 

Or.  Hedge  noted  that  as  the  Air  Force  moves  into  more  abstract  specialties  (e.g.,  supervisory 
or  administrative),  the  job  content  universes  may  be  less  amenable  to  WTPT.  The  Air  Force  should 
consider  developing  for  jobs  indexes  that  describe  the  percentage  of  tasks  that  can  be  measured 
by  WTPT  technology.  These  indexes  could  be  useful  for  constructing  a  job  classification  system 
that  predicts  the  amenability  of  jobs  to  various  performance  measurement  technologies. 

Ms.  Lipscomb  described  the  task  sampling  plan  that  the  Air  Force  used  to  develop  WTPT 
measures  for  jet  engine  mechanics.  In  this  plan,  a  wealth  of  information  was  available  for 
sampling  the  job  content  domain;  e.g.,  general  job  description,  job  training  requirements,  and 
occupational  analysis  information.  Several  parameters  were  defined  for  sampling  the  tasks  from 
that  domain,  and  they  Included  being  part  of  the  plan  of  Instruction  at  the  training  school, 
performed  by  at  least  30X  of  first-term  incumbents,  and  part  of  a  duty  cluster.  This  task 
sampling  plan  is  impressive  in  terms  of  Its  specificity  and  detail.  The  plan  certainly 
exemplifies  the  content-oriented  approach  to  test  construction. 

Several  research  questions  come  to  mind  about  the  task  sampling  plan.  First,  is  the  extreme 
detail  in  the  plan  necessary?  Furthermore,  are  the  particular  levels  of  the  parameters 
important?  Would  essentially  the  same  tasks  have  been  selected  with  less  detailed  plans  or 
different  parameter  levels?  I  propose  that  a  sensitivity  analysis  (Fischoff,  1980)  be  conducted 
on  the  task  sampling  plan. 

We  know  too  little  about  the  plans  and  parameters  that  are  used  to  sample  tasks  for 
content-oriented  test  construction.  If  less  costly  and  time-consuming  plans  can  be  used  to 
identify  tasks,  they  should  be  used. 

Furthermore,  sensitivity  analyses  that  are  done  across  several  jobs  may  Indicate  that  the 
parameters  and  complexity  of  the  plans  vary  across  specialties  and  job  types.  Such  knowledge 
would  be  quite  useful  for  expanding  the  theory  and  technique  of  content- oriented  test 
construction. 

Lt  Col  8allentine  described  the  lessons  that  were  learned  from  the  procedures  used  for  the 
construction  of  WTPT  measures  i  <r  jet  engine  mechanics.  Although  extensive  task  Information  was 
available.  It  was  not  sufficient  for  test  construction.  SMEs  were  interviewed  to  check  the 
appropriateness  of  the  tasks  for  test  construction.  Some  tasks  were  too  complex  for  testing  and 
needed  to  be  subdivided,  whereas  other  tasks  were  too  trivial  and  were  Ignored.  This  result 


underscores  the  generic  nature  of  job  analytic  information  and  indicates  the  importance  of 
adapting  the  information  to  meet  its  intended  use. 

Additional  reviews  by  SM£s  at  different  Air  Force  bases  identified  procedural  differences 
among  the  bases  in  how  tasks  were  accomplished.  Tasks  were  not  considered  for  test  construction 
if  the  differences  had  to  be  included  as  steps  in  the  work  samples.  Procedural  differences  bear 
on  the  general izability  of  the  test  scores  and  should  serve  as  a  red  flag.  Ideally,  WTPT 
measures  should  be  assessed  for  location  differences  in  task  procedures  at  all  Air  Force  bases. 
Each  step  in  the  measures  must  be  applicable  to  every  job  incumbent.  Otherwise,  situational 
effects  will  account  for  some  unknown  proportion  of  variance  in  test  performance.  Clearly, 
assessing  location  differences  at  all  bases  is  economically  unfeasible.  However, 
General izability  Theory  (Cronbach,  Gleser,  Nanda,  4  Rajaratnam,  1972)  could  be  used  to  ascertain 
the  number  of  bases  necessary  to  reasonably  estimate  the  proportion  of  variance  in  test 
performance  that  is  accounted  for  by  procedural  differences. 

Finally,  a  concern  that  I  have  for  WTPT  is  the  quality  of  test  administration.  Test 
administrators  must  be  trained  in  how  to  collect  data  with  considerable  accuracy.  Unfortunately, 
we  know  too  little  about  accuracy  training.  Furthermore,  the  limited  body  of  research  suggests 
that  rating  accuracy  is  not  the  same  as  observation  accuracy  (see  Murphy,  Garcia,  Kerkar,  Martin, 
4  Balzer,  1982).  Clearly,  both  rating  accuracy  and  observational  accuracy  will  be  needed  in 
WTPT.  Oepending  on  the  job,  WTPT  measures  can  require  the  administrators  to  rate  and  observe 
test  performance.  The  Air  Force  should  give  strong  consideration  to  devoting  some  of  its  efforts 
to  R40  in  accuracy  training. 
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